Summary

The following provides a description of single-cell RNA-seq analysis of motorneurons collected and sequenced from Crotalus atrox in 2018.

In this analysis, I examined expression at the *gene level* between locomotor and spinal cord cells to determine genes that were differentially expressed, using the tiger rattlesnake as a reference. We see 26 differentially expressed genes, which includes some interesting candidates, but no potassium channels, somewhat consistent in magnitude with an earlier analysis.

A final plot compares genes elected in another analysis for qPCR, showing that the ss-rnaseq data is consistent with those results.

Methods

Cell Collection

<BORIS NEEDS TO ADD METHODS DETAILS HERE>

Library preparation

Boris delivered 2 plates with 54 locomotor samples and 78 shaker samples to Alexander Janjic in May 2018 see (Boris_June2018.pdf). Elsewhere, the laboratory notes say that there were 54 locomotor cells and 62 shaker cells. In the notes, it says During centrifugation, 3 locomotor and 16 shaker samples were lost due to an accident. These numbers don’t add up.

Most plausibly, it looks like there should be 51 locomotor cells and 62 shaker cells in in the experiment.

Library was prepared using the mcSCRB-seq protocol v 1.1 (JWB Bagnoli et al. 2018).

i7 index N714 was used for locomotor cell plate, i7 index N715 was used for shaker cell plate.

Barcode sets per cell were unknown, requiring us to digitally demultiplex the cells…

Index Name Nextera Phenotype
GCTCATGA Boris1 N714 Locomotor
ATCTCAGG Boris2 N715 Shaker

Other laboratory specific notes are in the file Boris_June2018.pdf set from Alexander Janjic in June 2018.

Sequencing

The samples were then pooled and run on 0.5 lanes of XX instrument. (NEED MORE DETAILS FROM ALEXSANDAR, if he remembers…!)

Analysis

Raw single cell data was processed using the zUMIs pipeline (Parekh et al. 2018), which filters reads with low-quality barcodes and UMIs, and then uses the STAR aligner v2.7.3a (Dobin et al. 2012) to align reads to the Croatuls tigris genome (NCBI RefSeq GCF_016545835.1) (Margres et al. 2021). zUMIs then predicts cell barcodes and collapses UMIs to create a read count table for downstream analysis.

Read count tables were then analyzed using the (Hao et al. 2021) package to determine expression differences between cell pools.

All code for data analysis can be found at (https://github.com/msuefishlab/INSERT_REPO_NAME)

Results

Import ZUMIs Summary Data

rps<-read.table(file.path(root,"output_data/single_snakes/zUMIs_output/stats/single_snakes.readspercell.txt"),header = T)
tc<-data.frame(rps)
tc$type<-factor(tc$type)
tc$plate<-as.factor(substr(tc$RG,7,14))
tc$cell<-as.factor(substr(tc$RG,0,6))
levels(tc$plate)[1] <- "Bad"
levels(tc$plate)[2] <- "ATCTCAGG"
levels(tc$plate)<-c("Bad","Shaker","Locomotor")

featColors<-c("#1A5084", "#914614" ,"#118730","grey33","tan1","#631879FF","gold1","grey73","firebrick3")
names(featColors)<-c("Exon","Intron+Exon","Intron","Unmapped","Ambiguity","MultiMapping","Intergenic","Unused BC","User")


tc %>% 
  filter(plate != "Bad") %>%
  filter(type != "Unmapped") %>%
  ggplot(aes(y=N,fill=plate))+
        geom_density(alpha=0.7) +
        xlab("") +
        ylab("log10(Number of reads)") +
        scale_y_log10()+ 
        coord_flip()+
        scale_fill_manual(values=mycolors)+
        geom_hline(yintercept = 1e04)

Based on filtering out cells with < 10,000 reads, we get a number that looks similar to the number of input cells, with 10 more locomotor cells then we are supposed to have… let’s dig into this a bit more.

tc %>% filter(plate != "Bad" & N > 1e+04 & type != "Unmapped") %>%group_by(cell,plate) %>% summarise(Total=sum(N)) %>% group_by(plate) %>% summarise(n=n()) %>% kbl() %>% kable_styling()
plate n
Shaker 63
Locomotor 60

Import ZUMIs and Convert to Seurat Object

zumis_output <- readRDS(file.path(root,"output_data/single_snakes/zUMIs_output/expression/single_snakes.dgecounts.rds"))
umis <- as.matrix(zumis_output$umicount$inex$all)
seu <- Seurat::CreateSeuratObject(counts = umis)
seu@meta.data$celltype<-as.factor(substr(row.names(seu@meta.data),7,14))
levels(seu@meta.data$celltype)<-c("shaker","shaker","locomotor")

QC

Right out of the gate, zUMIs reduces the 192 possible barcodes from our file to to 162 potential cells (equally distributed between the two phenotypes). We know that there were less cells than this, suggesting that some of these cells represent “noise”…

seu@meta.data %>% group_by(celltype) %>% summarise(n=n()) %>% kbl() %>%
  kable_styling()
celltype n
shaker 81
locomotor 82
Identify Ribosomal Genes

We can see right away that most cells are ~ 20% ribosomal RNA, but locomotor cells have a long tail, suggesting some “bad” cells with low ribosomal RNA in them…

#grep("^RP[LS]",rownames(seu@assays$RNA@counts),value = TRUE)
PercentageFeatureSet(seu,pattern="^RP[LS]|LOC120298527|LOC120298533") -> seu$percent.Ribosomal

seu@meta.data %>%
    ggplot(aes(x=percent.Ribosomal, color = celltype, fill=celltype)) +
    geom_density(alpha = 0.7) +
    theme_classic()+
    scale_fill_manual(values=mycolors)+
    geom_vline(xintercept=12)

Calculate Percent Largest Gene
apply(
  seu@assays$RNA@counts,
  2,
  max
) -> seu$largest_count

apply(
  seu@assays$RNA@counts,
  2,
  which.max
) -> seu$largest_index

rownames(seu)[seu$largest_index] -> seu$largest_gene

100 * seu$largest_count / seu$nCount_RNA -> seu$percent.Largest.Gene
Calculate Read Counts

Again, we can see a pretty sharp bimodal distribution with a large number of cells < 10,000 reads and another large group > 10,000 reads…

seu@meta.data %>%
    ggplot(aes(x=nCount_RNA, color = celltype, fill=celltype)) +
    geom_density(alpha = 0.7) +
    theme_classic()+
    scale_x_log10()+
    scale_fill_manual(values=mycolors)+
    geom_vline(xintercept=1.4e04)

Let’s look at the number of unique genes found in each sample. Here we can see that shaker cells have a strange distribution: one population has about 1000 unique features which is not present in locomotor cells at the same frequency. In the majority of cells we detect approximately 7500 unique features (genes)…

seu@meta.data %>%
    ggplot(aes(x=nFeature_RNA, color = celltype, fill=celltype)) +
    geom_density(alpha = 0.7) +
    theme_classic()+
    scale_x_log10()+
    scale_fill_manual(values=mycolors)+
    geom_vline(xintercept=3000)

Filtering

Based on these values, we have a reasonable set of filters to propose:

filtered_seu <- subset(
  seu,
    nCount_RNA>1.4e04 &
    nFeature_RNA > 3000 &
    percent.Ribosomal > 12)

filtered_seu@meta.data %>% group_by(celltype) %>% summarise(n=n()) %>% kbl() %>%
  kable_styling()
celltype n
shaker 45
locomotor 45

This gives us a total of 90 cells to examine, let’s take a look at the distributions of cells a bit more closely (turning off log scales now…)

filtered_seu@meta.data %>%
    ggplot(aes(x=nCount_RNA, color = celltype, fill=celltype)) +
    geom_density(alpha = 0.7) +
    theme_classic()+
    scale_fill_manual(values=mycolors)+
    geom_vline(xintercept=1.4e04)+
    geom_vline(xintercept=1.25e05)

What I see here is a bit of “ringing” in both samples with very high read counts, which I think are referred to as “doublets” in ss-RNAseq land. Let’s get rid of those now too!

filtered_seu <- subset(
  seu,
    nCount_RNA>1.4e04 &
    nFeature_RNA > 3000 &
    percent.Ribosomal > 12 &
    nCount_RNA<1.25e05 )

filtered_seu@meta.data %>% group_by(celltype) %>% summarise(n=n()) %>% kbl() %>%
  kable_styling()
celltype n
shaker 41
locomotor 34

This gives us a grand total of 75 cells. Let’s have a look at the distributions of our QC data in this filtered dataset.

Idents(filtered_seu) <- 'celltype'
qc_plots<-VlnPlot(filtered_seu,features=c("nFeature_RNA", "nCount_RNA","percent.Ribosomal"),alpha(0.5),combine=FALSE)
for(i in 1:length(qc_plots)) {
  qc_plots[[i]] <- qc_plots[[i]] + scale_fill_manual(values=mycolors)+ theme(legend.position = 'none')
}
CombinePlots(qc_plots)

The distributions of our QC data seem to mostly overlap and make good biological sense. Let’s proceeed!

Clustering

What are the most variable genes in our dataset, potentially prediciting what cell type cluster they belong to?

filtered_seu <- NormalizeData(filtered_seu, normalization.method = "RC", scale.factor = 1e6)
filtered_seu <- FindVariableFeatures(filtered_seu, selection.method = "vst", nfeatures = 2000)
top10 <- head(VariableFeatures(filtered_seu), 10)
plot1 <- VariableFeaturePlot(filtered_seu)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot2

ScaleData(filtered_seu,features=rownames(filtered_seu)) -> filtered_seu
RunPCA(filtered_seu,features=VariableFeatures(filtered_seu)) -> filtered_seu
DimPlot(filtered_seu,reduction="pca",group.by="celltype") + scale_color_manual(values=mycolors)

I’m not seeing a super clean separation between shaker and locomotor cells, but there is sort of a trend here…

Differential Expression

Idents(filtered_seu) <- 'celltype'
markers<-FindMarkers(filtered_seu,ident.1 = "shaker", min.pct = 0.25)
sig_genes<-subset(markers, p_val_adj<0.1)
sig_genes$id<-rownames(sig_genes)
sig_genes<-as.data.frame(sig_genes)


sig_list<-merge(sig_genes,subset(gtf,type=="transcript") %>% group_by(gene) %>% filter(!duplicated(gene)) %>% select(gene,product),by.x="id",by.y="gene",all.x=TRUE)

sig_list[,c(1,7,2,3,4,5,6)] %>% arrange(p_val_adj) %>% kbl() %>% kable_styling()
id product p_val avg_log2FC pct.1 pct.2 p_val_adj
LOC120298205 uncharacterized LOC120298205 0.0e+00 -6.587978 0.024 1.000 0.0000000
PDE7A phosphodiesterase 7A, transcript variant X1 0.0e+00 -4.613105 0.317 1.000 0.0000000
LPAR6 lysophosphatidic acid receptor 6 0.0e+00 -3.783138 0.073 0.912 0.0000001
LOC120317132 galectin-6-like 0.0e+00 -4.390212 0.073 0.824 0.0000010
LOC120318694 uncharacterized LOC120318694, transcript variant X1 0.0e+00 3.398786 0.927 0.441 0.0000022
LOC120319595 vomeronasal type-2 receptor 26-like 0.0e+00 -4.527950 0.098 0.794 0.0000048
LOC120303167 splicing factor SWAP, transcript variant X1 0.0e+00 2.144733 1.000 0.853 0.0000089
EPB41L3 erythrocyte membrane protein band 4.1 like 3, transcript variant X5 0.0e+00 -1.003199 1.000 1.000 0.0001171
TUBB tubulin beta class I 0.0e+00 -1.395067 0.927 1.000 0.0001508
AK5 adenylate kinase 5, transcript variant X1 0.0e+00 -2.764053 0.390 0.912 0.0001856
ABCC9 ATP binding cassette subfamily C member 9, transcript variant X2 0.0e+00 1.454360 1.000 0.941 0.0002655
LOC120319596 vomeronasal type-2 receptor 26-like 1.0e-07 -1.379046 0.878 1.000 0.0015383
IKZF1 IKAROS family zinc finger 1, transcript variant X2 2.0e-07 -2.272177 0.366 0.824 0.0043207
PTPRM protein tyrosine phosphatase receptor type M, transcript variant X1 2.0e-07 -1.218622 1.000 1.000 0.0049298
LOC120307612 tubulin beta-2 chain 3.0e-07 -1.238360 0.976 1.000 0.0069229
SDC2 syndecan 2 4.0e-07 -2.192644 0.488 1.000 0.0085474
SEZ6 seizure related 6 homolog, transcript variant X3 6.0e-07 1.452948 1.000 0.824 0.0120074
NALCN sodium leak channel, non-selective, transcript variant X3 1.3e-06 -2.108295 0.634 0.912 0.0263756
HOXD9 homeobox D9 1.3e-06 2.138052 0.902 0.412 0.0265779
LZTR1 leucine zipper like transcription regulator 1, transcript variant X2 2.5e-06 -1.564748 0.561 0.941 0.0509420
CREB5 cAMP responsive element binding protein 5, transcript variant X7 2.6e-06 -1.722398 0.707 0.971 0.0538470
TRNAA-AGC-9 NA 2.9e-06 -4.533654 0.000 0.441 0.0603690
TPPP3 tubulin polymerization promoting protein family member 3, transcript variant X1 3.0e-06 -1.260155 0.854 1.000 0.0630456
LOC120301339 5S ribosomal RNA 3.8e-06 -3.370541 0.244 0.765 0.0792824
TUBB2A tubulin beta 2A class IIa 4.0e-06 -1.463318 0.756 0.941 0.0825205
CNTROB centrobin, centriole duplication and spindle assembly protein, transcript variant X4 4.0e-06 2.071961 0.829 0.294 0.0839091

I’ve been able to identify 26 DE genes between shaker and locomotor cells. Of particular interest might be the following: NALCN (sodium leak channel) and splicing factor SWAP. Here are expression plots for all 26 genes.

myfeatures<-sig_list$id
plots <- VlnPlot(filtered_seu,group.by="celltype", split.by = "celltype",features=myfeatures,combine=FALSE) 

for(i in 1:length(plots)) {
  plots[[i]] <- plots[[i]] + geom_boxplot(alpha=0.5) + theme(legend.position = 'none') + scale_fill_manual(values=mycolors) 
}
CombinePlots(plots)

Here are the genes examined in our qPCR assay:

goi<-c("HOXD9","KCNQ3","KCNQ2","LOC120319564")

plots <- VlnPlot(filtered_seu,group.by="celltype", split.by = "celltype",features=goi,combine=FALSE) 

for(i in 1:length(plots)) {
  plots[[i]] <- plots[[i]] + geom_boxplot(alpha=0.2) + theme(legend.position = 'none') + scale_fill_manual(values=mycolors) 
}
CombinePlots(plots)

Finally, CHAT appears to be a good marker for motorneurons, it’s highly expressed and seems to be uniformly abundant in both samples…

goi_motorneuron_markers<-c("CHAT")

plots <- VlnPlot(filtered_seu,group.by="celltype", split.by = "celltype",features=goi_motorneuron_markers,combine=FALSE) 

for(i in 1:length(plots)) {
  plots[[i]] <- plots[[i]] + geom_boxplot(alpha=0.2) + theme(legend.position = 'none') + scale_fill_manual(values=mycolors) 
}
CombinePlots(plots)

Here are the additional genes requested by Max:

goi<-c("ABCC8","KCNJ8","KCNJ11")

plots <- VlnPlot(filtered_seu,group.by="celltype", split.by = "celltype",features=goi,combine=FALSE) 

for(i in 1:length(plots)) {
  plots[[i]] <- plots[[i]] + geom_boxplot(alpha=0.2) + theme(legend.position = 'none') + scale_fill_manual(values=mycolors) 
}
CombinePlots(plots)

Here are the additional genes requested by Boris:

goi<-c("NALCN")

plots <- VlnPlot(filtered_seu,group.by="celltype", split.by = "celltype",features=goi,combine=FALSE) 

for(i in 1:length(plots)) {
  plots[[i]] <- plots[[i]] + geom_boxplot(alpha=0.2) + theme(legend.position = 'none') + scale_fill_manual(values=mycolors) 
}
CombinePlots(plots)

## KV 7.2 and KV 7.3

goi_motorneuron_markers<-c("KCNQ3")

plots <- VlnPlot(filtered_seu,group.by="celltype", split.by = "celltype",features=goi_motorneuron_markers,combine=FALSE) 

for(i in 1:length(plots)) {
  plots[[i]] <- plots[[i]] + geom_boxplot(alpha=0.2) + theme(legend.position = 'none') + scale_fill_manual(values=mycolors) 
}
CombinePlots(plots)

goi_motorneuron_markers<-c("KCNQ2")

plots <- VlnPlot(filtered_seu,group.by="celltype", split.by = "celltype",features=goi_motorneuron_markers,combine=FALSE) 

for(i in 1:length(plots)) {
  plots[[i]] <- plots[[i]] + geom_boxplot(alpha=0.2) + theme(legend.position = 'none') + scale_fill_manual(values=mycolors) 
}
CombinePlots(plots)

New Heatmap Analaysis

kv_channels<-c("KCNH8","KCNH3","KCNH2","KCNH6","KCNH1","KCNH5","KCNQ3","KCNQ5","KCNQ4","KCNQ1","KCNQ2","KCNS3","KCNS1","KCNS2","KCNV1","KCNG1","KCNG4","KCNF1","KCNV2","KCNV3","KCND1","KCND2","KCND3","KCNB1","KCNB2","KCNC2","KCNC1","KCNC4","KCNC3","KCNA6","KCNA1","KCNA3","KCNA2","KCNA7","KCNA4","KCNA10","KCNA5")

kv_names<-c("Kv12.1","Kv12.2","Kv11.3","Kv11.1","Kv10.1","Kv10.2","Kv7.3","Kv7.5","Kv7.4","Kv7.1","Kv7.2","Kv9.3","Kv9.1","Kv9.2","Kv8.1","Kv6.1","Kv6.4","Kv5.1","Kv8.2","Kv6.3","Kv4.1","Kv4.2","Kv4.3","Kv2.1","Kv2.2","Kv3.2","Kv3.1","Kv3.4","Kv3.3","Kv1.6","Kv1.1","Kv1.3","Kv1.2","Kv1.7","Kv1.4","Kv1.8","Kv1.5")

kv_table <- data.frame(
  Names = kv_names,
  row.names = kv_channels
)

kv_filtered_seu<-subset(x=filtered_seu, features=kv_channels)

label_mapping <- setNames(kv_table$Names, rownames(kv_table))


#DoHeatmap(object=kv_filtered_seu,features=kv_channels,slot="counts")+  scale_fill_gradientn(colors=c("white", "yellow", "red"))+ scale_y_discrete(labels = label_mapping)

#VlnPlot(kv_filtered_seu,features=kv_channels, stack= TRUE, fill.by = "ident", flip=TRUE, sort="decreasing") +coord_flip() +scale_y_discrete(labels = label_mapping)

# Step 1: Calculate average expression values for each feature across all cell types
avg_expr <- AverageExpression(kv_filtered_seu, features = kv_channels, group.by = "celltype")
avg_expr <- avg_expr$RNA  # Replace 'RNA' with the appropriate assay name if different

# Calculate the average expression for each feature across all cell types
avg_expr_feature <- rowMeans(avg_expr, na.rm = TRUE)

# Step 2: Sort features by their average expression
sorted_features <- sort(avg_expr_feature, decreasing = TRUE)

# Step 3: Generate and sort plots
plots <- VlnPlot(kv_filtered_seu, group.by="celltype", split.by = "celltype", features=names(sorted_features), flip=TRUE, combine=FALSE, sort=TRUE)

# Number of columns in the combined plot
n_col <- 2

# Number of rows needed, assuming you fill column-wise
n_row <- ceiling(length(plots) / n_col)

# Generate and sort plots
plots <- VlnPlot(kv_filtered_seu, group.by="celltype", split.by = "celltype", features=names(sorted_features), flip=TRUE, combine=FALSE, sort=TRUE)

for(i in 1:length(plots)) {
  # Identify the last plot in each column
  last_in_columns <- seq(from = n_col * (n_row - 1) + 1, to = n_col * n_row, by = 1)
  
  # Extract the feature name from the plot title
  feature_name <- plots[[i]]$labels$title
  
  # Look up the corresponding protein name
  protein_name <- kv_table[feature_name, "Names"]
  
  # Remove x-axis and y-axis labels for all but the last plot in each column
  if (i %in% last_in_columns || i == length(plots)) {
    # Keep x-axis labels, remove y-axis labels and line
    plots[[i]] <- plots[[i]] + 
      ggtitle(paste0(protein_name,"(",feature_name,")")) +  # Replace title
      scale_fill_manual(values=mycolors) + 
      coord_flip() +
      ylim(0, 600) +
      scale_x_discrete(expand = c(0, 0)) +  # Start x-axis at 0
      theme(axis.line.y=element_blank(),  # Remove y-axis line
            axis.title.y=element_blank(),
            axis.text.y=element_blank(),
            axis.ticks.y=element_blank(),
            plot.title.position = "plot",
            plot.title = element_text(hjust = 0),
            legend.position="none")
  } else {
    # Remove both x-axis and y-axis labels and line
    plots[[i]] <- plots[[i]] + 
      ggtitle(paste0(protein_name,"(",feature_name,")")) +  # Replace title
      scale_fill_manual(values=mycolors) + 
      coord_flip() +
      ylim(0, 500) +
      scale_x_discrete(expand = c(0, 0)) +  # Start x-axis at 0
      theme(axis.line.y=element_blank(),  # Remove y-axis line
            axis.title.y=element_blank(),
            axis.text.y=element_blank(),
            axis.ticks.y=element_blank(),
            axis.title.x=element_blank(),
            axis.text.x=element_blank(),
            plot.title.position = "plot",
            plot.title = element_text(hjust = 0),
            axis.ticks.x=element_blank(),
            legend.position="none")
  }
}

# Combine the sorted plots
CombinePlots(plots, n_col)

References

Dobin, Alexander, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R. Gingeras. 2012. “STAR: Ultrafast Universal RNA-Seq Aligner.” Bioinformatics 29 (1): 15–21. https://doi.org/10.1093/bioinformatics/bts635.
Hao, Yuhan, Stephanie Hao, Erica Andersen-Nissen, William M. Mauck III, Shiwei Zheng, Andrew Butler, Maddie J. Lee, et al. 2021. “Integrated Analysis of Multimodal Single-Cell Data.” https://doi.org/10.1016/j.cell.2021.04.048.
JWB Bagnoli, Johannes, Christoph Ziegenhain, Aleksandar Janjic, Lucas Esteban Wange, Beate Vieth, Swati Parekh, Johanna Geuder, Ines Hellmann, and Wolfgang Enard. 2018. “mcSCRB-Seq Protocol V1.” http://dx.doi.org/10.17504/protocols.io.nrkdd4w.
Margres, Mark J., Rhett M. Rautsaw, Jason L. Strickland, Andrew J. Mason, Tristan D. Schramer, Erich P. Hofmann, Erin Stiers, et al. 2021. “The Tiger Rattlesnake Genome Reveals a Complex Genotype Underlying a Simple Venom Phenotype.” Proceedings of the National Academy of Sciences 118 (4). https://doi.org/10.1073/pnas.2014634118.
Parekh, Swati, Christoph Ziegenhain, Beate Vieth, Wolfgang Enard, and Ines Hellmann. 2018. “zUMIs - A Fast and Flexible Pipeline to Process RNA Sequencing Data with UMIs.” GigaScience 7 (6). https://doi.org/10.1093/gigascience/giy059.

  1. Michigan State University Department of Integrative Biology, ↩︎